CLaRK - an XML-based System for Corpora Development
نویسندگان
چکیده
In this paper we describe the architecture and the intended applications of the CLaRK System. The development of the CLaRK System started under the T ubingen-So a International Graduate Programme in Computational Linguistics and Represented Knowledge (CLaRK). The main aim behind the design of the system is the minimization of human intervention during the creation of corpora. Creation of corpora is still an important task for the majority of languages like Bulgarian, where the invested e ort in such development is very modest in comparison with more intensively studied languages like English, German and French. We consider the corpora creation task to be the editing, manipulation, searching and transforming of documents. Some of these tasks will be done for a single document or a set of documents, others will be done on a part of a document. Besides eÆciency of the corresponding processing in each state of the work, the most important investment is the human labor. Thus, in our view, the design of the system has to be directed to minimization of the human work. For document management, storing and querying we chose the XML technology because of its popularity and its ease for understanding. Very soon the XML technology will be a part of our lives and it will be the predominant language for data description and exchange on the Internet. Moreover, a lot of already developed standards for corpus descriptions like [XCES, 2001] and [TEI, 2001] are already adapted to the XML requirements. The core of the CLaRK System is an XML Editor which is the main interface to the system. With the help of the editor the user can create, edit or browse XML documents. To facilitate the corpus management, we enlarge the XML inventory with facilities that support linguistic work. We added the following basic language processing modules: a tokenizer with a module that supports a hierarchy of token types, a nite-state engine that supports the writing of cascaded nite-state grammars and facilities that search for nite-state patterns, the XPath query language which is able to support navigation over the whole set of mark-up of a document, mechanisms for imposing constraints over XML documents which are applicable in the context of some events. We envisage several uses for our system:
منابع مشابه
The CLaRK System Tools XML-based Corpora Development
CLaRK is an XML-based software system for corpora development. It incorporates several technologies: XML technology; Unicode; Regular Cascaded Grammars; Constraints over XML Documents. The basic components of the system are: a tagger, a concordancer, an extractor, a grammar processor, a constraint engine.
متن کاملDevelopment of Corpora within the CLaRK System: The BulTreeBank Project Experience
CLaRK is an XML-based software system for corpora development. It incorporates several technologies: XML technology; Unicode; Regular Cascaded Grammars; Constraints over XML Documents. The basic components of the system are: a tagger, a concordancer, an extractor, a grammar processor, a constraint engine.
متن کاملThe CLaRK System: XML-based Corpora Development System for Rapid Prototyping
The paper presents the CLaRK System as a tool for the creation of XML-based corpora and a platform for rapid prototyping. The system provides a set of basic tools for processing XML documents. These tools include: tokenizers, regular grammars, constraints; remove, insert, extract, sort, transformation operations. Additionally, the system is equipped with a macro language which allows the creati...
متن کاملCascaded Regular Grammars over XML Documents
The basic mechanism of CLaRK for linguistic processing of text corpora is the cascade regular grammar processor. The main challenge to the grammars in question is how to apply them on XML encoding of the linguistic information. The system o ers a solution using an XPath language for constructing the input word to the grammar and an XML encoding of the categories of the recognized words.
متن کاملmemasysco: XML schema based metadata management system for speech corpora
The metadata management system for speech corpora “memasysco” has been developed at the Institut für Deutsche Sprache (IDS) and is applied for the first time to document the speech corpus “German Today”. memasysco is based on a data model for the documentation of speech corpora and contains two generic XML schemas that drive data capture, XML native database storage, dynamic publishing, and inf...
متن کامل